K-mean is, without doubt, the most popular clustering method. Researchers released the algorithm decades ago, and lots of improvements have been done to k-means.
The algorithm tries to find groups by minimizing the distance between the observations, called local optimal solutions. The distances are measured based on the coordinates of the observations.
We will use the Prices of Personal Computers dataset to perform our clustering analysis. This dataset contains 6259 observations and 10 features. The dataset observes the price from 1993 to 1995 of 486 personal computers in the US. The variables are price, speed, ram, screen, cd among other.
library(dplyr)
D <-read.csv("Datasets/Computers.csv",h=T)
df <- D %>% select(-c(X, cd, multi, premium))
head(df)
price speed hd ram screen ads trend
1 1499 25 80 4 14 94 1
2 1795 33 85 2 14 94 1
3 1595 25 170 4 15 94 1
4 1849 25 170 8 14 94 1
5 3295 33 340 16 14 94 1
6 3695 66 340 16 14 94 1
rescale_df <- df %>%
mutate(price_scal = scale(price),
hd_scal = scale(hd),
ram_scal = scale(ram),
screen_scal = scale(screen),
ads_scal = scale(ads),
trend_scal = scale(trend)) %>%
select(-c(price, speed, hd, ram, screen, ads, trend))
You rescale the variables with the scale() function of the dplyr library. The transformation reduces the impact of outliers and allows to compare a sole observation against the mean. If a standardized value (or z-score) is high, you can be confident that this observation is indeed above the mean (a large z-score implies that this point is far away from the mean in term of standard deviation. A z-score of two indicates the value is 2 standard deviations away from the mean. Note, the z-score follows a Gaussian distribution and is symmetrical around the mean.
km <- kmeans(rescale_df, 5)
# matrix of cluster centres
km$centers
price_scal hd_scal ram_scal screen_scal ads_scal trend_scal
1 -0.5464214 -0.6352703 -0.6320622 -0.45584738 0.6881761 -0.3382442
2 -0.7586644 0.2832692 -0.3204174 0.08806206 -0.8936123 1.2198712
3 0.2679166 -0.7060987 -0.3074995 -0.24669894 -1.3212297 -1.5498077
4 0.8634498 2.0584350 2.0734266 0.63708443 -0.9224672 1.2199320
5 1.0242599 0.1953981 0.5121237 0.49673680 0.6635108 -0.3172871
# A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
km$cluster[1:30]
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# The total sum of squares.
km$totss
[1] 37548
# Vector of within-cluster sum of squares, one component per cluster.
km$withinss
[1] 3178.529 3307.672 1492.700 2374.915 5114.641
# Total within-cluster sum of squares, i.e. sum(withinss).
km$tot.withinss
[1] 15468.46
# The between-cluster sum of squares, i.e. totss-tot.withinss.
km$betweenss
[1] 22079.54
One technique to choose the best k is called the elbow method. This method uses within-group homogeneity or within-group heterogeneity to evaluate the variability. In other words, you are interested in the percentage of the variance explained by each cluster. You can expect the variability to increase with the number of clusters, alternatively, heterogeneity decreases. Our challenge is to find the k that is beyond the diminishing returns. Adding a new cluster does not improve the variability in the data because very few information is left to explain.
kmean_withinss <- function(k) {
cluster <- kmeans(rescale_df, k)
return (cluster$tot.withinss)
}
kmean_withinss(2)
[1] 27087.07
# Set maximum cluster
max_k <-20
# Run algorithm over a range of k
wss <- sapply(2:max_k, kmean_withinss)
# Create a data frame to plot the graph
elbow <-data.frame(k= 2:max_k, wss)
library(plotly)
p <- elbow %>% ggplot(aes(x=k, y=wss)) +
geom_line() +
geom_point(size=3,color="mediumvioletred")
p <- ggplotly(p)
p
km_2 <-kmeans(rescale_df, 7)
km_2$size
[1] 228 448 539 910 458 1554 2122
center <-km_2$centers
You can create a heat map to help us highlight the difference between categories.
library(plotly)
library(reshape2)
center_df <- center %>%
melt(c("Cluster","Attributes")) %>%
ggplot(aes(Attributes,Cluster, fill = value)) +
geom_tile()+ scale_fill_gradient2(low = "navy", mid = "white", high = "mediumvioletred")
center_df <- ggplotly(center_df)
center_df
library(cluster)
library(dplyr)
library(reshape2)
library(plotly)
credit <- read.csv("Datasets/credit.csv",h=T)
credit$Income <- as.factor(credit$Income)
credit$Credit_cards <- as.factor(credit$Credit_cards)
credit$Education <- as.factor(credit$Education)
credit$Car_loans <- as.factor(credit$Car_loans)
D <- credit %>% select(-Credit_rating)
D_mat <- daisy(D,metric = "gower")
kmed <- pam(D_mat,2,diss=T)
kmed$medoids
[1] 373 1947
credit[kmed$medoids,]
Credit_rating Age Income Credit_cards Education Car_loans
373 0 31.92189 2 2 2 2
1947 1 34.31873 2 1 1 1
M <- data.frame(table(kmed$clustering,credit$Credit_rating))
names(M)[1:2] <- c("cluster","Credit_rating")
M
cluster Credit_rating Freq
1 1 0 888
2 2 0 132
3 1 1 745
4 2 1 699
p2 <- M %>%
ggplot(aes(x = cluster,y= Freq,fill=Credit_rating)) +
geom_bar(stat="identity", position = position_dodge2())
p2 <- ggplotly(p2)
p2